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Abstract. We review recent results about the maximal values of the 
Kullback-Leibler information divergence from statistical models defined 
by neural networks, including naive Bayes models, restricted Boltzmann 
machines, deep belief networks, and various classes of exponential fami- 
lies. We illustrate approaches to compute the maximal divergence from 
a given model starting from simple sub- or super-models. We give a new 
result for deep and narrow belief networks with finite- valued units. 
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1 Introduction 

In statistical learning theory, probability models are used to infer representations 
of data. In model selection it is often assumed that the model approximation 
errors are negligible compared with the statistical approximation errors. This 
assumption may not always be justified in practice; in some cases even full di- 
mensional models only fill a small portion of the space of probability distribu- 
tions, and telling the general structure of the data generating distributions, in 
order to constrain the possible model classes, is difficult. 

Here we take a complementary perspective, disregarding the statistical ap- 
proximation errors and focussing on the model approximation errors. We quan- 
tify the model approximation error of a model M. by the divergence func- 
tion p i-» D(p\\M) = M qeM D(p\\q), where D(p\\q) = J2 X P( x ) lo § fffy is the 
Kullback-Leibler divergence from p to We study the maximum value of 
D(-\\A4), which corresponds to a worst-case analysis. The ideas from this pa- 
per can also be used to study the expectation value given a prior on the set of 
target distributions, see |16j . The model approximation error can be used as a 

4 We formulate our results in such a way that they are independent from the loga- 
rithm's base used in the definition of the divergence. 



criterion for model selection. Related ideas are discussed in [2] in the context of 
model design and reinforcement learning. 

Most probability models with hidden variables are singular and not identi- 
fiable. Moreover, data distributions that are not contained in these models can 
have several maximum likelihood estimates. Although controlling parameter- 
identifiability is crucial when estimating learning coefficients in Bayesian model 
selection, we will instead focus on the value of the data likelihood and the sets 
of maximizing distributions, irrespective of their parameters. 

In general, the function D(-||.M) has no explicit formula, making the estima- 
tion of the maximizers and the maximum value difficult. For exponential families 
the situation is slightly better, as for each distribution p the divergence D(p\\-) 
has a unique minimizer over Ai. For certain families, such as independence 
models and convex exponential families, there even is a closed formula for this 
function. The approximation properties of various classes of exponential families 
have been studied in [9 10 1 18 )19|16|6) . The divergence from complicated mod- 
els can be estimated by finding tractable exponential subfamilies. This idea was 
used in |17) to study approximation errors of restricted Boltzmann machines. 

The representational power of neural networks has been studied for many 
years and by too many authors to refer to appropriately at this place, see for 
instance [31514] . The representational power of the networks discussed in this 
paper has been studied, in particular, in |7|2 8 13 1 7|llj . 

Section [2] reviews bounds on Dj^ for statistical models defined by neural net- 
works and for exponential families. Section [3] discusses strategies to bound Dj^ 
via sub-models and super-models, and discusses a class of exponential families 
contained in restricted Boltzmann machines and deep belief networks. Section [4] 
puts our results in perspective. 



2 Maximal information divergence 

We consider neural networks with a set of visible units X\, . . . , X n , where each 
Xi takes values in a finite set Xi of cardinality \Xi\ — iVj. See Fig. [I] The visible 
state space of such a system is X = X\ X • • • X X n . For any subset A C [n] 
let Na = ILeA be the number of joint states of the units indexed by A, and 
let N = N[ n ] = \X\. We denote the set of all probability distributions (p x )xex on 
X by A(X), or A if X is understood. The maximal information divergence 
from a model M. C A{X) is D» := max pgz i D(p\\A4). An rl-projection of p E A 
onto M is a point pm in the closure M. of M. with D(p\\M) — D(p\\pm)- 



2.1 Probability models defined by neural networks 

The independence model E\ of n variables X\ , . . . , X n is the set of probability 
distributions of the form p{x) — Yiie[n] Pii x i) f° r a ^ x — ( x i> • • • ■> x n) S X. This 
model describes non-interacting stochastic variables. The following result is due 
to Ay and Knauf [TJ Corollary 4.10]. 
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Fig. 1. The nai've Bayes model M n ,k, the restricted Boltzmann machine RBM n , m , and 
a deep belief network. Light (dark) nodes represent hidden (visible) variables. 



Lemma 1. The maximal divergence to £^ is bounded by 

D £ i < log(N/maxNi) . 

If all variables are q-ary, then Dgi = (ri — 1) log(q), and the maximizers are the 
uniform distributions on q-ary codes of cardinality q and minimum distance n. 

The mixture of product distributions M n ,k, or nai've Bayes model, is 

the graphical model on a star graph, where the leaves are visible variables, and 
the internal node is a hidden variable with k states. 

Theorem 1. Let AC [n]. If k > Ni^A, then Dm^ k * s bounded by 

D Mnk < log(N A /maxNj) . 

jeA 

When all visible variables are binary, we have the tighter bound 

D Mnik < (n - Uog 2 (fc)J - 2U ^) log(2) . 

Note the similarity of the bounds given in Lemma [l] and Theorem [T] In fact 
Theorem [l] can be derived from Lemma [l] together with Lemma [6] given below. 

The restricted Boltzmann machine RBM„ m is the undirected stochastic 
network with full bipartite interaction graph K n m , where an independent set of 
m units is hidden, and an independent set of n units is visible. 

Theorem 2. Let 4 C [n], and let Mi, . . . ,M m be the sizes of the state spaces 
of the hidden variables. If 1 + X^e[m] (-^j ~ 1) — -^[n]\A> then 

^RBM„ m < \og(N A /maxNj) . 

jeA 



When all units are binary, and m < 2 n 1 — 1, we have the tighter bound 

m + 1 

2Llog 2 (m+l)J 



D R BM n , m < (n - Uog 2 (m + l)j - -^-j— m ) log(2) 



Theorem [2] subsumes divergence bounds for naive Bayes models (when m = 
1) and independence models (when m — 0). This result was shown in the binary 
case in [T71 Theorem 2] and in the non-binary case in [TSl Theorem 29]. 

A deep belief network (DBN) is a layered stochastic network with undi- 
rected bipartite interactions between the units in the deepest two layers, which 
form an RBM, and directed bipartite interactions between all other pairs of sub- 
sequent layers, directed towards the first layer, which is the only visible layer. 

Theorem 3. Consider a DBN with L layers, each layer containing n units with 
state spaces of cardinalities q\, . . . , q n . Let m be any integer with Y[ j= m +2 1j — 

s — l 

m < n, and let qi > ■ ■ ■ > q m . If L > 2 + q q \_ 1 for some S € {0,1, ... , m}, then 

-Ddbn < log(iV [m _ s] ) . 

In particular, when all units are binary and the network has L > 1 + 2 s layers 
of size n = 2 k ~ 1 + k, for some S G {0, 1, ... , 2 k ~ 1 }, then 

#dbn < - S) log(2) . 

The binary case is [131 Theorem 2], together with [TH Theorem 18]. The 
non-binary case is new (details in |12)V 

The bounds in Theorems [TJ [2] and [3] vanish when the number of hidden 
units is large enough (depending on their state spaces) . In this case, the models 
can approximate all probability distributions on the states of their visible units 
arbitrarily well, i.e., they are universal approximators. 

All these theorems can be proved using the same strategy: First, a family of 
exponential sub-models is identified, and then, the divergence from the union of 
these sub-models is bounded from above, as in Theorem [8] below. 

2.2 Exponential families 

Exponential families are widely-used statistical models. Examples include log- 
linear models, hierarchical models, and independence models. The information 
divergence maximization problem is by far better understood for exponential 
families than for other probability models. We use exponential families to ap- 
proximate probability models with hidden variables. 

Let g — {Ai, . . . , A m } be a partition of X . The partition model V e consists 
of all p € A with p(x) = p(y) whenever x,y belong to the same block of g; 
that is, the conditional distribution of p, conditioned on any block Ai G g, 
equals the uniform distribution. Partition models are the convex exponential 
families that contain the uniform distribution. The following is a special case 
of Corollary 1]: 

Lemma 2. Let g = {A±, . . . , A m } be a partition of X , and denote by c(g) = 
max ie [ m ] \Ai\ the coarseness of g. Then D-p e — log(e(f>)), and the global max- 
imizers are the distributions p with supp(p) (1 Ai < 1 for all i € [m], where 
supp(p) fl A4 = 1 holds only if \Ai\ = c(g). 



The maximal divergence from any exponential family of dimension k can be 
bounded from below as follows, see [EH Theorem 28]: 

Theorem 4. Let £ be an exponential family of dimension k. Then 

D £ > log(TV) - log(fc + 1) . 

If equality holds, then £ is a partition model with homogeneous partition. 

Probability models defined as marginals of exponential families can behave 
very different from proper exponential families. Any finite subset of A can be 
embedded in a projection of a two-dimensional exponential family, see [2]: 

Lemma 3. Given any finite set of probability distributions {p^}fLi C <4jv-1; 
there is a two-dimensional exponential family £ C Ak-i, and a linear map 
ip: A K - X -> A N - X) such that ip(£) D {p w }£- x . 

3 Estimating the information divergence 
3.1 Subfamilies and superfamilies 

If M! C A4 then Dj^ < Dm 1 - I n special cases it is possible to have equality. 

Lemma 4. If M! C A4 and if p is a maximizer of the divergence from A4 such 
that M! contains an rl-projection pm of p to A4, then p maximizes the diver- 
gence from M! among the set {q € A : qM S M! for some r I -projection qjn}- 

The lemma is useful for exponential families; where the set of distributions whose 
rJ-project to M lies in A4', can be parametrized via M.' +jV, where Af is the 
normal space of M.. The following argument due to Juricek [6] is an example: 

Let Ai = £\ be the independence model of n g-ary variables and let M! be the 
set of i.i.d. distributions. By Lemma [l] the uniform distribution p on the states 
(1, . . . , 1), (2, . . . , 2), . . . , (q, . . . , q) maximizes the divergence from M., and it is 
exchangeable. Since the r/-projections of the set of exchangeable distributions 
to Ai belong to A4', Lemma [4] implies that p maximizes the divergence from 
A4' among the exchangeable distributions, with divergence D(p\\A4') = (n — 
l)log(g). Now, M 1 as a subset of the exchangeable simplex can be identified 
with the multinomial model. This proves the following result [H Theorem 1.1]. 

Theorem 5. The maximal divergence from the multinomial model of n q-ary 
variables is equal to (n — 1) log(q). 

Conversely, simple subfamilies can be used to study larger models: 

Lemma 5. Let £ be an exponential family. Let A4i be a sub-model of £ with 
D_Mi — K an d divergence maximizers Qi, for all i G [k]. If there is a point 
p G Q = C\iQi with pg £ UiAii, then Dg = K and the divergence maximizers are 
exactly the points in Q whose r I -projections onto £ lie in n,jV{j. 



Lemma [5] can be used to prove the homogeneous case of Lemma [T] as follows: 
The independence model of n q-axy variables contains the partition model Vi 
with partition blocks {x: Xi — Ui\ for all t/, € X^, for any i G [n]. By Lemma [2j 
the maximal divergence from the partition model Vi is D-p i — (n—1) log(g), and 
the set of maximizers is the set Qi of distributions p whose support supp(p) = 
{x^}j satisfies x^' ^ x\ j ' for all j ^= j'. The intersection Q = HiGi is the set 
of probability distributions with support on a code of minimum distance n. The 
r /-projection of an arbitrary element p £ Q lies in C\{Pi = {u} if and only if p is 
a uniform distribution on a code of minimum distance n and cardinality q. By 
Lemma [5] these are the global divergence maximizers from E\. 

3.2 Mixtures of exponential families with disjoint supports 

The mixture Mixt(A^i, . . . , Mk) of k models M.\, . . . , Mk C A is the set of 
probability distributions of the form p — 53i=i ^iP i where A € A^-i and 
pW £ f or a n j g j n general, mixtures are difficult to describe, even for 
simple models Mi, . . . , Mk- The situation is much simpler when mixing models 
supported on disjoint subsets of X: 

Lemma 6. Let {j4i, . . . , A^} be a partition of X and let Mi, ... , Mk be sta- 
tistical models with Mi C A{Ai). For any p € A(X), the r I -projections of p to 
Mixt(A^i, ■ ■ ■ ,Mk) are the distributions of the form 

Pm{x) = p(Ai)p_M.(x), for all x G Ai for all i £ [k], 

where PMi denotes an rl-projection of p{x\Ai) to Mi for all i £ [k]. 

We call a set y C Xi X • • • X X n cubical if it can be written as a product 
y = x • • • x y n with C Xi for all i € [n] . A set y is cubical iff there exists 
a product distribution p with supp(p) := {x € X : p(x) > 0} = y (in this case 
yi = supp(pi)). We call a partition cubical if it consists of cubical blocks. For 
any cubical set y let £h denote the set of product distributions with support y. 

Let g = {Ai, . . . , Ak] be a cubical partition of X. The mixture of products 
with disjoint supports g is the model M e :— Mixt(£^ i , . . . , £\ k ) — ■^■n,k- F° r 
this kind of models, Lemmas [T] and [6] show: 

Corollary 1. Let g = {A±, . . . , Ak} be a cubical partition of X with blocks Ai = 
3^,i x • • • x y i n with \y^j\ S for all j £ [n], for all i £ [k]. Then 

D Me = max log (1^,1/^) . 

i£ [k] 

3.3 Unions of exponential families 

Let M* nk = {} g .\ e \ =k M e C M n .k be the union of mixtures of products with 
disjoint supports g, where g runs over all cubical partitions of X with k blocks. 
The set M* nk is not an exponential family, but a finite union of exponential 



families. Similarly, let M* l k = U e :| e |=fc be the union of all partition models 
V g of partitions g with k cubical blocks. 

Our motivation for studying unions of mixture models and unions of partition 
models comes from the following two results. For simplicity, we consider binary 
units; analogue results for non-binary units can be found in |15j and |12j . 

Theorem 6 ([17, Theorem 1]). The binary model RBM n , m contains any mix- 
ture of one arbitrary product distribution, m — k product distributions with mu- 
tually disjoint supports, and k distributions with support on any edges of the 
n-cube, for any < k < m. In particular, RBM„ jm contains VVJ* m+1 . 

Theorem 7 ([14, Theorem 17]). Let L € N, let k be the largest integer for 
which L > 1 + 2^ 2 -* , and let K = 2 fe_1 + k < n. The binary deep belief network 
model with L layers of width n contains any partition model V e with partition 
q = {{x: x x = Vx}- Vx G {0,1}^}, where A C [n],\X\ =K. 

Unions of exponential families are more difficult to describe than exponential 
families, but the maximal r7-projection can be approximated as follows: 

Theorem 8. Let X = {0, 1}" . If k < 2"" 1 , then 

D MU < (n - Llog 2 (fc)j - ^^j) log(2) . 

Ifk< 2 n , then 

D Mi, h , < (n + 1 - Llog 2 (fc)j - ^^j) log(2) . 

The first part was shown in .17! Theorem 2] . The second part can be proved with 
a direct adaptation of the same proof. Theorem |8j together with Theorems [6] 
and[7J proves the 'tighter bounds' in Theorems [T] and [2j 

4 Discussion 

When we plot the approximation error bounds of the model classes discussed 
here against the corresponding number of model parameters, we find that they 
all behave similarly; they all decay logarithmically on a large scale. This is the 
optimal maximal approximation error behaviour of exponential families (The- 
orem [4]). The bounds for partition models, homogeneous independence models, 
and mixtures of products with disjoint homogeneous supports, are tight. The 
naive Bayes model bound is tight for many choices of the iVj in the sense that it 
vanishes iff the model is a universal approximator, see [11]. The other bounds for 
the more complicated models are probably not tight. It is reasonable to expect 
that fixing the number of parameters, models with many hidden units fill the 
probability simplex more evenly than their counterparts with fewer or no hid- 
den units (see, e.g., Lemma [3]). For the discussed model classes, this paper does 
not give conclusive answers in that direction, since the only maximal divergence 
lower-bounds are for exponential families. It should be mentioned, however, that 



the mere existence of universal approximators within a given class of networks 
is not always obvious and sometimes false. For example, DBNs with too narrow 
hidden layers are never universal approximators, regardless of their parameter 
count. 
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